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Abstract 

We build up the mathematical connection between the “Expectation-Maximization” (EM) algorithm and 
gradient-based approaches for maximum likelihood learning of finite Gaussian mixtures. We show that the 
EM step in parameter space is obtained from the gradient via a projection matrix P, and we provide an 
explicit expression for the matrix. We then analyze the convergence of EM in terms of special properties 
of P and provide new results analyzing the effect that P has on the likelihood surface. Based on these 
mathematical results, we present a comparative discussion of the advantages and disadvantages of EM 
and other algorithms for the learning of Gaussian mixture models. 
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1 Introduction 


The “Expectation-Maximization” (EM) algorithm is a 
general technique for maximum likelihood (ML) or max¬ 
imum a posteriori (MAP) estimation. The recent em¬ 
phasis in the neural network literature on probabilistic 
models has led to increased interest in EM as a possible 
alternative to gradient-based methods for optimization. 
EM has been used for variations on the traditional theme 
of Gaussian mixture modeling (Ghahramani & Jordan, 
1994; Nowlan, 1991; Xu & Jordan, 1993a, b; Tresp, Ah¬ 
mad & Neuneier, 1994; Xu, Jordan & Hinton, 1994) and 
has also been used for novel chain-structured and tree- 
structured architectures (Bengio & Frasconi, 1995; Jor¬ 
dan & Jacobs, 1994). The empirical results reported in 
these papers suggest that EM has considerable promise 
as an optimization method for such architectures. More¬ 
over, new theoretical results have been obtained that 
link EM to other topics in learning theory (Amari, 1994; 
Jordan & Xu, 1993; Neal & Hinton, 1993; Xu & Jordan, 
1993c; Yuille, Stolorz & Utans, 1994). 

Despite these developments, there are grounds for 
caution about the promise of the EM algorithm. One 
reason for caution comes from consideration of theoret¬ 
ical convergence rates, which show that EM is a first 
order algorithm . 1 More precisely, there are two key re¬ 
sults available in the statistical literature on the con¬ 
vergence of EM. First, it has been established that un¬ 
der mild conditions EM is guaranteed to converge to 
a local maximum of the log likelihood l (Boyles, 1983; 
Dempster, Laird & Rubin, 1977; Redner & Walker, 
1984; Wu, 1983). (Indeed the convergence is monotonic: 
/(0G+ 1 )) > /(0(G), where 0(G is the value of the pa¬ 
rameter vector 0 at iteration k.) Second, considering 
EM as a mapping 0G+ 1 ) = M(0(©) with fixed point 
0 * = M( 0 *), we have ©G+ 1 )-©* ~ dM g (®*) ( 0 (G _ 0 *) 
when ©G+ 1 ) is near 0 *, and thus 
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almost surely. That is, EM is a first order algorithm. 

The first-order convergence of EM has been cited in 
the statistical literature as a major drawback. Red¬ 
ner and Walker (1984), in a widely-cited article, argued 
that superlinear (quasi-Newton, method of scoring) and 
second-order (Newton) methods should generally be pre¬ 
ferred to EM. They reported empirical results demon¬ 
strating the slow convergence of EM on a Gaussian mix¬ 
ture model problem for which the mixture components 
were not well separated. These results did not include 
tests of competing algorithms, however. Moreover, even 
though the convergence toward the “optimal” parameter 
values was slow in these experiments, the convergence in 
likelihood was rapid. Indeed, Redner and Walker ac¬ 
knowledge that their results show that “... even when 


1 An iterative algorithm is said to have a local convergence 
rate of order q > 1 if ||©(* +1 ) - ©*||/||© (fe) - 0*|| 9 < r + 
o(||0(G — 0*||) for k sufficiently large. 


the component populations in a mixture are poorly sep¬ 
arated, the EM algorithm can be expected to produce 
in a very small number of iterations parameter values 
such that the mixture density determined by them re¬ 
flects the sample data very well.” In the context of the 
current literature on learning, in which the predictive 
aspect of data modeling is emphasized at the expense of 
the traditional Fisherian statistician’s concern over the 
“true” values of parameters, such rapid convergence in 
likelihood is a major desideratum of a learning algorithm 
and undercuts the critique of EM as a “slow” algorithm. 

In the current paper, we provide a comparative anal¬ 
ysis of EM and other optimization methods. We empha¬ 
size the comparison between EM and other first-order 
methods (gradient ascent, conjugate gradient methods), 
because these have tended to be the methods of choice 
in the neural network literature. However, we also com¬ 
pare EM to superlinear and second-order methods. We 
argue that EM has a number of advantages, including its 
naturalness at handling the probabilistic constraints of 
mixture problems and its guarantees of convergence. We 
also provide new results suggesting that under appropri¬ 
ate conditions EM may in fact approximate a superlin¬ 
ear method; this would explain some of the promising 
empirical results that have been obtained (Jordan & Ja¬ 
cobs, 1994), and would further temper the critique of EM 
offered by Redner and Walker. The analysis in the cur¬ 
rent paper focuses on unsupervised learning; for related 
results in the supervised learning domain see Jordan and 
Xu (in press). 

The remainder of the paper is organized as follows. 
We first briefly review the EM algorithm for Gaussian 
mixtures. The second section establishes a connection 
between EM and the gradient of the log likelihood. We 
then present a comparative discussion of the advantages 
and disadvantages of various optimization algorithms in 
the Gaussian mixture setting. We then present empir¬ 
ical results suggesting that EM regularizes the condi¬ 
tion number of the effective Hessian. The fourth section 
presents a theoretical analysis of this empirical finding. 
The final section presents our conclusions. 


2 The EM algorithm for Gaussian 
mixtures 


We study the following probabilistic model: 

K 

P(x\Q) = (!) 
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where ay > 0 and XyLi a j = 1 and d is the dimension¬ 
ality of the vector x. The parameter vector 0 consists 
of the mixing proportions ay, the mean vectors rrij , and 
the covariance matrices Xy . 

Given K and given N independent, identically dis¬ 
tributed samples {ajWjjV, we obtain the following log 



likelihood: 2 
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which can be optimized via the followin g iterative algo- 
rithm (see, e. g , Dempster, Laird & Rubin, 1977): 


of the matrix B, and “<g>” denotes the Kronecker prod¬ 
uct. Moreover, given the constraints EyLi = 1 ani 1 
a E > 0, pW is a positive definite matrix and the ma¬ 
trices Pm] and pW are positive definite with probability 
one for N sufficiently large. 


,(k + 1) _ 

i 

Ef=i hf\t) 

N 

(3) 

E+i) _ 

Ef=i h?\t)xW 


3 

Ef=i hf\t) 


fk + 1) _ 

E "=A\t)[x {t) - 

m' ] k+1 \[x < ' t l — m ( fc+1 )]T 


E LM k \t)xM 


where the posterior probabilities /iE are defined as fol¬ 
lows: 
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3 Connection between EM and 
gradient ascent 

In the followin g theorem we establish a relationship be¬ 
tween the g radient of the lo g likelihood and the step in 
parameter space taken by the EM al g orithm. In par¬ 
ticular we show that the EM step can be obtained by 
premultiplyin g the g radient by a positive definite ma¬ 
trix. We provide an explicit expression for the matrix. 

Theorem 1 At each iteration of the EM algorithm Eg. 
(3), we have 
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where A denotes the vector of mixing proportions 
[aq, • • •, cxk\ T , j indexes the mixture components (j = 
1, • • •, K), k denotes the iteration number, \ec[B]” de¬ 
notes the vector obtained by stacking the column vectors 


2 Although we focus on maximum likelihood (ML) estima¬ 
tion in this paper, it is straightforward to apply our results 
to maximum a posteriori (MAP) estimation by multiplying 
the likelihood by a prior. 


Proof. (1) We begin by considering the EM update 
for the mixing proportions ay. From Eqs. (1) and (2), 
we have 
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Premultiplying by Pjf \ we obtain 
(k) C)l , 
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The update formula for A in Eq. (3) can be rewritten as 
1 N 

A (k+i) = A (k) + Y)h[ k) (t), • • •, h (k \t)] T - A {k \ 

v t =l 

Combining the last two equations establishes the update 
rule for A (Eq. 4). Furthermore, for an arbitrary vec¬ 
tor u , we have Nu T P^u = i/ T diag[aW, • • •, a^]a — 
(u T MW) 2 . By Jensen’s inequality we have 

u T diag[E/4, • • •, a^ } ]u = ^ oEu 2 

j =i 

> (J 2 a ? )u jy 2 

j =i 

= {u T A (k) f. 

Thus, u T Pjpu > 0 and pW is positive definite given 
the constraints Ylf=i = 1 an, 4 > 0 for all j- 
(2) We now consider the EM update for the means 
mi. It follows from Eqs. (1) and (2) that 

m N 

^ t = 1 

Premultiplying by Pm] yields 
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From Eq. (3), we have Etli ^"'(f) > 0; moreover, XlE 
is positive definite with probability one assuming that N 



is large enough such that the matrix is of full rank. Thus, 
it follows from Eq. ( 8 ) that Pm) is positive definite with 
probability one. 

(3) Finally, we prove the third part of the theorem. 
It follows from Eqs. (1) and (2) that 
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Utilizing the identity vec[T5U] = (C T (g> T)vec[5], we 
obtain 
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where equality holds only when S T =0 for all U. 
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Equality is impossible, however, since E) J is positive 
dehnite with probability one N is sufhciently large. Thus 
it follows from Eq. (9) and hf{i) > 0 that P s 
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and P( 0 ) = diag[P mi , • • •, P mK , Ps,, • • •, Pz k ,Pa], we 
can combine the three updates in Theorem 1 into a single 
equation: 

0 ( * + i) = e^ + p(e^)^\ @=@(k) , (io) 

Under the conditions of Theorem 1, P(0( fc )) is a positive 
dehnite matrix with probability one. Recalling that for 

07 T 07 

a positive dehnite matrix B, we have > 0, we 

have the following corollary: 

Corollary 1 For each iteration of the EM algorithm 
given by Eg.(3), the search direction 0 ( fc+1 ) — 0 G) has 
a positive projection on the gradient of l. 


That is, the EM algorithm can be viewed as a variable 
metric gradient ascent algorithm for which the projection 
matrix PfO^) changes at each iteration as a function 
of the current parameter value ©W. 

Our results extend earlier results due to Baum and 
Sell (1968). Baum and Sell studied recursive equations 
of the following form: 


,(*+i) - 


E{x {k y) 

T(xW))i 


E(x (k y) 

[T(x^) lr --,T(x^) K ] 
) dJ/dx\ 


Ef=i xY>dj/dx\ 


„G); 


G) 


where xf > 0 
„G) 


y k 

l^i = l 


(k) 

r) ; = 1, where J is a polyno¬ 
mial in x'd > having positive coefficients. They showed 
that the search direction of this recursive formula, i.e., 
Elx^l) — x( k \ has a positive projection on the gradient 
of of J with respect to the x^l (see also Levinson, Ra- 
biner & Sondhi, 1983). It can be shown that Baum and 
Sell’s recursive formula implies the EM update formula 
for A in a Gaussian mixture. Thus, the first statement 
in Theorem 1 is a special case of Baum and Sell’s earlier 
work. However, Baum and Sell’s theorem is an existence 
theorem and does not provide an explicit expression for 
the matrix P _4 that transforms the gradient direction 
into the EM direction. Our theorem provides such an 
explicit form for P_ 4 . Moreover, we generalize Baum and 
Sell’s results to handle the updates for mj and E j , and 
we provide explicit expressions for the positive dehnite 
transformation matrices P mj and . as well. 

It is also worthwhile to compare the EM algorithm to 
other gradient-based optimization methods. Newton’s 
method is obtained by premultiplying the gradient by 
the inverse of the Hessian of the log likelihood: 
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Newton’s method is the method of choice when it can 
be applied, but the algorithm is often difficult to use 
in practice. In particular, the algorithm can diverge 
when the Hessian becomes nearly singular; moreover, 
the computational costs of computing the inverse Hes¬ 
sian at each step can be considerable. An alternative 
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is to approximate the inverse by a recursively updated 
matrix 5( fc+1 ) = B^ + r/AB^ k K Such a modification 
is called a quasi-Newton method. Conventional quasi- 
Newton methods are unconstrained optimization meth¬ 
ods, however, and must be modified in order to be used 
in the mixture setting (where there are probabilistic con¬ 
straints on the parameters). In addition, quasi-Newton 
methods generally require that a one-dimensional search 
be performed at each iteration in order to guarantee con¬ 
vergence. The EM algorithm can be viewed as a special 
form of quasi-Newton method in which the projection 
matrix P(Q^ k ^) in Eq. (10) plays the role of B^K As 
we discuss in the remainder of the paper, this partic¬ 
ular matrix has a number of favorable properties that 
make EM particularly attractive for optimization in the 
mixture setting. 


4 Constrained optimization and general 
convergence 

An important property of the matrix P is that the EM 
step in parameter space automatically satisfies the prob¬ 
abilistic constraints of the mixture model in Eq. (1). 
The domain of 0 contains two regions that embody the 
probabilistic constraints: V i = {0 : J2f= l = 1} an d 

V 2 = {0 : cx'p > 0, Ej positive definite}. For the EM 
algorithm the update for the mixing proportions ay can 
be rewritten as follows: 

(ib+i) if afPjx^mf,^) 

N hi eL 4 k) p( x(t) \ m ?\ e w ) 


It is obvious that the iteration stays within V \. Simi¬ 
larly, the update for Xy can be rewritten as: 


y(fc + l) 


1 ^ af ) P{x (t )\mf\Y!'p) 

Ef = i hf\t) tt Ef=i ] P(^\rn?\ Sf } ) 

[; x ( - t - ) — ra®][i® — 


which stays within V 2 for N sufficiently large. 

Whereas EM automatically satisfies the probabilistic 
constraints of a mixture model, other optimization tech¬ 
niques generally require modihcation to satisfy the con¬ 
straints. One approach is to modify each iterative step 
to keep the parameters within the constrained domain. 
A number of such techniques have been developed, in¬ 
cluding feasible direction methods, active sets, gradient 
projection, reduced-gradient, and linearly-constrained 
quasi-Newton. These constrained methods all incur ex¬ 
tra computational costs to check and maintain the con¬ 
straints and, moreover, the theoretical convergence rates 
for such constrained algorithms need not be the same as 
that for the corresponding unconstrained algorithms. A 
second approach is to transform the constrained opti¬ 
mization problem into an unconstrained problem before 
using the unconstrained method. This can be accom¬ 
plished via penalty and barrier functions, Lagrangian 
terms, or re-parameterization. Once again, the extra al¬ 
gorithmic machinery renders simple comparisons based 


on unconstrained convergence rates problematic. More¬ 
over, it is not easy to meet the constraints on the covari¬ 
ance matrices in the mixture using such techniques. 

A second appealing property of T 3 (©(*)) is that each 
iteration of EM is guaranteed to increase the likelihood 
(i.e., /(©(*+!)) > /(©(*))). This monotonic convergence 
of the likelihood is achieved without step-size parameters 
or line searches. Other gradient-based optimization tech¬ 
niques, including gradient descent, quasi-Newton, and 
Newton’s method, do not provide such a simple theo¬ 
retical guarantee, even assuming that the constrained 
problem has been transformed into an unconstrained 
one. For gradient ascent, the step size rj must be chosen 
to ensure that 11©C*+ 1 ) — 0( fc_1 )||/||(0l fc ) — ©( fc_1 ))|| < 
||7 + ryf?(0 ( - fc_1 ' ) ))|| < 1. This requires a one-dimensional 
line search or an optimization of rj at each iteration, 
which requires extra computation which can slow down 
the convergence. An alternative is to fix rj to a very 
small value which generally makes ||7 + r?7f (©( fc_1 )))ll 
close to one and results in slow convergence. For New¬ 
ton’s method, the iterative process is usually required 
to be near a solution, otherwise the Hessian may be in¬ 
definite and the iteration may not converge. Levenberg- 
Marquardt methods handle the indefinite Hessian ma¬ 
trix problem; however, a one-dimensional optimization 
or other form of search is required for a suitable scalar 
to be added to the diagonal elements of Hessian. Fisher 
scoring methods can also handle the indefinite Hessian 
matrix problem, but for non-quadratic nonlinear opti¬ 
mization Fisher scoring requires a stepsize rj that obeys 
||7 + y5Tf(©( fc-1 )))|| < 1, where B is the Fisher infor¬ 
mation matrix. Thus, problems similar to those of gra¬ 
dient ascent arise here as well. Finally, for the quasi- 
Newton methods or conjugate gradient methods, a one¬ 
dimensional line search is required at each iteration. In 
summary, all of these gradient-based methods incur ex¬ 
tra computational costs at each iteration, rendering sim¬ 
ple comparisons based on local convergence rates unre¬ 
liable. 

For large scale problems, algorithms that change the 
parameters immediately after each data point (“on-line 
algorithms”) are often significantly faster in practice 
than batch algorithms. The popularity of gradient de¬ 
scent algorithms for neural networks is in part to the 
ease of obtaining on-line variants of gradient descent. 
It is worth noting that on-line variants of the EM algo¬ 
rithm can be derived (Neal & Hinton, 1993, Titterington, 
1984), and this is a further factor that weighs in favor 
of EM as compared to conjugate gradient and Newton 
methods. 

5 Convergence rate comparisons 

In this section, we provide a comparative theoretical dis¬ 
cussion of the convergence rates of constrained gradient 
ascent and EM. 

For gradient ascent a local convergence result can by 
obtained by Taylor expanding the log likelihood around 
the maximum likelihood estimate 0*. For sufficiently 
large k we have: 

|| 0 (*+i) — 0*|| < ||7 + r ? 77(0*))||||(0( fc )-0*)|| (12) 



and 


Since H(Q*) is negative definite, we obtain 


||7 + V H(Q*)\\ < A M [I+ VH(0*)} = r, (13) 
where H is the Hessian of /, rj is the step size, and 
r = max{11 - rjX M [-H(Q*)]\, |1 - rjX m [-H(Q*)]\}, 
where A m [A] and A m [A] denote the largest and small¬ 
est eigenvalues of A, respectively. 

Smaller values of r correspond to faster convergence 
rates. To guarantee convergence, we require r < 1 or 
0 < rj < 2/A m[—H(Q*)\. The minimum possible value 
of r is obtained when rj = 1/A m[H(Q*)] with 

r-min = i-\ m [H(e*)\/\ M [H(e*)\ 

= 1 — K- 1 [f7(0*)], 

where k[H] = Xm[H]/X m [H] is the condition number of 
H . Larger values of the condition number correspond to 
slower convergence. When k[H] = 1 we have r msn = 0, 
which corresponds to a superlinear rate of convergence. 
Indeed, Newton’s method can be viewed as a method 
for obtaining a more desirable condition number—the 
inverse Hessian 7f _1 balances the Hessian H such that 
the resulting condition number is one. Effectively, New¬ 
ton can be regarded as gradient ascent on a new func¬ 
tion with an effective Hessian that is the identity matrix: 
H e ff = H~ 1 H = I. In practice, however, n[H] is usually 
quite large. The larger k[H] is, the more difficult it is to 
compute 7f _1 accurately. Hence it is difficult to balance 
the Hessian as desired. In addition, as we mentioned 
in the previous section, the Hessian varies from point 
to point in the parameter space, and at each iteration 
we need recompute the inverse Hessian. Quasi-Newton 
methods approximate f7(0A)) _1 by a positive matrix 
5 ^) that is easy to compute. 

The discussion thus far has treated unconstrained op¬ 
timization. In order to compare gradient ascent with 
the EM algorithm on the constrained mixture estima¬ 
tion problem, we consider a gradient projection method: 

0 (. + i) = 0 (.) + „n t _|L y ( i 4 ) 

where n*, is the projection matrix that projects the gra¬ 
dient Qo( k ) into T^i- This gradient projection iteration 
will remain in V\ as long as the initial parameter vector 
is in V\. To keep the iteration within V 2 , we choose an 
initial 0 (°) £ T >2 and keep rj sufficiently small at each 
iteration. 

Suppose that E = [ei, • • •, e m \ are a set of indepen¬ 
dent unit basis vectors that span the space V \. In this 
basis, ©A) and d Q ! (k) become 0^ = E T Q ^ k ) and 

diwTc = ET diwj’ respectively, with ||0? } - 0*|| = 

11©1*1 — 0*||. In this representation the projective gradi¬ 
ent algorithm Eq. (14) becomes simple gradient ascent: 
©(fc+i) _ q(*) _j_ rj ^\ k) . Moreover, Eq. (12) becomes 

||©(*+i) _ ©*|| < \\E tC (I + ry7f(0*))||||( 0 ( fc ) — 0*)||. As 
a result, the convergence rate is bounded by 

r c = \\ E T (I + ijH(Q*))\\ _ 

< ^A m [E t (I + ijH(e*))(I + ijH(e*)) T E] 

= y/X m [E t (I + 2 ijH(e*) + ry 2 i7 2 (0*))A]. 


r c < \Jl + ij 2 X 2 m [-H c ] - 2r]X m [-H c \. (15) 

In this equation H c = E T H(Q)E is the Hessian of l 
restricted to V\. 

We see from this derivation that the convergence 
speed depends on k[H c ] = Xm[~ H c ]/X m [— H c ], When 
k[H c ] = 1, we have y/l + t] 2 X 2 m (-H c ) - 2r]X m [-H c ] = 
1 — r)X[—H c ], which in principle can be made to equal 
zero if rj is selected appropriately. In this case, a super- 
linear rate is obtained. Generally, however, k[H c ] 1, 
with smaller values of k[H c ] corresponding to faster con¬ 
vergence. 

We now turn to an analysis of the EM algorithm. As 
we have seen EM keeps the parameter vector within V\ 
automatically. Thus, in the new basis the connection 
between EM and gradient ascent (cf. Eq. (10)) becomes 

©G+ 1 ) = ©(*) + E T P{Q {k y)^ 

and we have 

HeG+i) _ e*|| < \\ e t (i + J p J ff(©*))||||(©( fc ) - 0*)|| 

with 

r c = \\E T (I + PH(e*))\\ 

< y/x M [E T (I + PH(0*))(I + PH(0*)) T E], 

The latter equation can be further manipulated to yield: 

r c < yj 1 + X 2 m [E t PHE] - 2A m [-E T PHE]. (16) 

Thus we see that the convergence speed of EM de¬ 
pends on k[E t PHE] = X M [E T PHE]/X m [E T PHE]. 
When k[E^PHE] = 1, A M [E T PHE] = 1, we 

have y/1 + X 2 m [E t PHE] - 2A m [—E T PHE] =(1- 
A m[~E t PHE]) = 0. In this case, a superlinear rate 
is obtained. We discuss the possibility of obtaining su¬ 
perlinear convergence with EM in more detail below. 

These results show that the convergence of gradient 
ascent and EM both depend on the shape of the log likeli¬ 
hood as measured by the condition number. When k[H] 
is near one, the configuration is quite regular, and the 
update direction points directly to the solution yielding 
fast convergence. When k[H] is very large, the l sur¬ 
face has an elongated shape, and the search along the 
update direction is a zigzag path, making convergence 
very slow. The key idea of Newton and quasi-Newton 
methods is to reshape the surface. The nearer it is to a 
ball shape (Newton’s method achieves this shape in the 
ideal case), the better the convergence. Quasi-Newton 
methods aim to achieve an effective Hessian whose con¬ 
dition number is as close as possible to one. Interest¬ 
ingly, the results that we now present suggest that the 
projection matrix P for the EM algorithm also serves 
to effectively reshape the likelihood yielding an effective 
condition number that tends to one. We first present 
empirical results that support this suggestion and then 
present a theoretical analysis. 
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Figure 1: Experimental results for the estimation of the 
parameters of a two-component Gaussian mixture, (a) 
The condition numbers as a function of the iteration 
number, (b) A zoomed version of (a) after discarding 
the first 25 iterations. The terminology ‘original, con¬ 
strained, and EM-equivalent. Hessians’ refers to the ma¬ 
trices H,E T HE, and E T PHE respectively. 


We sampled 1000 points from a simple finite mixture 
model given by 

p{x) = aipi{x) + a 2 p 2 {x) 


where 


Pi(x) 


1 

rof 


exp{ 


1 (x — mi) 2 

2 V 2 


}• 


The parameter values were as follows: aq = 

0.7170, q-2 = 0.2830, mi = —2, m 2 =2, a'f = 1, a 2 = 
1. We ran both the EM algorithm and gradient ascent 
on the data. At each step of the simulation, we calcu¬ 
lated the condition number of the Hessian (n [H(Q ik] )]), 
the condition number determining the rate of conver¬ 
gence of the gradient algorithm (k[E t HiO^'^E]), and 
the condition number determining the rate of conver¬ 
gence of EM (k[E t P(Q {k) )H(Q {k] )E]). We also, calcu¬ 
lated the largest eigenvalues of the matrices 77(0* fc) ), 
E T H{Q (k] )E , and E T P{e {k] )H{Q {k] )E . The results 
are shown in Fig. 1. As can be seen in Fig. 1(a), the con¬ 
dition numbers change rapidly in the vicinity of the 25th 
iteration and the corresponding Hessian matrices be¬ 
come indefinite. Afterward, the Hessians quickly become 
definite and the condition numbers converge. 3 As shown 
in Fig. 1(b), the condition numbers converge toward the 
values n[H(Q ik) )] = 47.5, k[E t H(Q ik) )E] = 33.5, and 
k[E t P(Q {k] )H(Q {k] )E] = 3.6. That is, the matrix P 
has greatly reduced the condition number, by factors of 
9 and 15. This significantly improves the shape of l and 
speeds up the convergence. 

We ran a second experiment in which the means of the 
component Ganssians were mi = — 1 and m 2 = 1. The 
results are similar to those shown in Fig. 1. Since the 
distance between two distributions is reduced into half, 


3 Interestingly, the EM algorithm converges soon afterward 
as well, showing that for this problem EM spends little time 
in the region of parameter space in which a local analysis is 
valid. 



Figure 2: Experimental results for the estimation of the 
parameters of a two-component. Gaussian mixture (cf. 
Fig. 1). The separation of the Ganssians is half the 
separation in Fig. 1. 




(a) (b) 

Figure 3: The largest eigenvalues of the matrices 
77, E T HE, and E T PHE plotted as a function of the 
number of iterations. The plot in (a) is for the experi¬ 
ment in Fig. 1; (b) is for the experiment reported in Fig. 
2 . 


the shape of l becomes more irregular. The condition 
number K[T7(0 (fc) )] increases to 352, k[E t H(Q {k 1 )E] in¬ 
creases to 216, and k[E t P(Q {k] )H(Q {k] )E] increases to 
61. We see once again a significant, improvement, in the 
case of EM, by factors of 4 and 6. 

Fig. 3 shows that, the matrix P has also reduced the 
largest, eigenvalues of the Hessian from between 2000 to 
3000 to around 1. This demonstrates clearly the sta¬ 
ble convergence that, is obtained via. EM, without, a. line 
search or the need for external selection of a. learning 
st.epsize. 

In the remainder of the paper we provide some theo¬ 
retical analyses that, attempt, to shed some light, on these 
empirical results. To illustrate the issues involved, con¬ 
sider a. degenerate mixture problem in which the mixture 
has a. single component.. (In this case aq = 1.) Let. us fur¬ 
thermore assume that, the covariance matrix is fixed (i.e., 
only the mean vector m is to be estimated). The Hes¬ 
sian with respect, to the mean m is 77 = — TVS -1 and the 
EM projection matrix P is E/N. For gradient, ascent., we 
have k[E t HE] = k[S - 1 ], which is larger than one when¬ 
ever E yl c7. EM, on t.hfc,other hand, achieves a. condi¬ 
tion number of one exactly ( k[E t PHE] = k[P 77] = 
k[ 7] = 1 and \m[E t PHE] = 1). Thus, EM and New- 
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ton’s method are the same for this simple quadratic 
problem. For general non-quadratic optimization prob¬ 
lems, Newton retains the quadratic assumption, yield¬ 
ing fast convergence but possible divergence. EM is 
a more conservative algorithm that retains the conver¬ 
gence guarantee but also maintains quasi-Newton be¬ 
havior. We now analyze this behavior in more detail. 
We consider the special case of estimating the means in 
a Gaussian mixture when the Gaussians are well sepa¬ 
rated. 

Theorem 2 Consider ihe EM algorithm in Eg. (3), 
where the parameters ay and Xj are assumed to be 
known. Assume that the K Gaussian distributions are 
well separated, such that for sufficiently large k the pos¬ 
terior probabilities h^ k \t) are nearly zero or one. For 
such k, the condition number associated with EM is al¬ 
ways smaller than the condition number associated with 
gradient ascent. That is: 

k[E t P{Q ( k) )H{Q (k) )E\ < k[E t H( e (k) )E]. 

Furthermore, \m[E t P(Q < ' k l)E[(Q < ' k l)E\ approach es one 
as k goes to infinity. 


Proof. The Hessian is 


where 


H = 


d 2 l 


[Hu 

H 12 • 

■ ■ Hik " 

H 21 

H 22 ■ 

• • H 2 k 

.Hki 

Hr2 ■ 

•• Hkk. 


dm; dmf 


(17) 


(18) 


N 


t = l 

N 

\^2lij{x (t) ){x (t) - m i )(* (t] ~ mi) T ]{Tf ) )“ 1 


t = 1 


with = (Sfj — h\ \t))hj \t). The projection 

matrix P is 

pC) = diag [ P G) ; ... ; ^G ]; 


where 


■y(k) 

pif) — 3 

jj t~^,N "'3 


E l\ 

t =1 




Given that h^ k \t)( 1 — h^ k \t)) is negligible for suffi¬ 
ciently large k, the second term in Eq. (19) can be 
neglected, yielding Ha = -(X^) -1 J2t=i hf \t) and 
E[ = diagfiJn, • • •, Hkk]- This implies that PE[ = —I, 
and thus k[PE[] = 1, whereas k[E[\ fz 1. □ 


This theorem, although restrictive in its assumptions, 
gives some indication as to why the projection matrix 
in the EM algorithm appears to condition the Hessian, 
yielding improved convergence. In fact, we conjecture 


that the theorem can be extended to apply more widely, 
in particular to the case of the full EM update in which 
the mixing proportions and covariances are estimated, 
and also, within limits, to cases in which the means are 
not well separated. To obtain an initial indication as to 
possible conditions that can be usefully imposed on the 
separation of the mixture components, we have stud¬ 
ied the case in which the second term in Eq. (19) is 
neglected only for Ha and is retained for the iJy com¬ 
ponents, where j fz i. Consider, for example, the case 
of a univariate mixture having two mixture components. 
For fixed mixing proportions and fixed covariances, the 
Hessian matrix (Eq. 17) becomes: 

tt _ hn h-12 

h-zi h 2 2 ’ 


and the projection matrix (Eq. 19) becomes: 
P = 


-hn 1 0 
0 —hr, 


where 


and 


hi i — 


- 2 (k) XT' * —1,2 


N 


2 (k) 

>i t = 1 


N 


H 3 ~ 2 (k) 2 (k) 

<V a j t = 1 


y^(l —h\ k \f))h^\f)(x < ' t ' , —mj) T (x < ' t ' , —m 


for i fz j = 1, 2. If H is negative definite, (i.e., hnh 2 2 — 
hi 2 h 2 i < 0), then we can show that the conclusions of 
Theorem 2 remain true, even for Gaussians that are not 
necessarily well-separated. The proof is achieved via the 
following lemma: 

Lemma 1 Consider the positive definite matrix 


j, _ Cll Cl2 

L cr 21 C 22 

For the diagonal matrix B = diagfun 1 , c^ 1 ], we have 
k[5X] < k[X], 


Proof. The eigenvalues of X are the roots of (an — 
A)((T 2 2 - A) - (72i(7i2 = 0, which gives 


A m 

A m 


1 

and 


<7ll + (722 + 7 
2 

<7ll + (722 ~ 7 
2 

V( a ll + cr 22) 2 — 4((7 i 1(722 — (721(712) 


«[X] 


<7ll + (722 + 7 
<7ll + <722 — 7 


The condition number k[X] can be written as k[X] = 
(1 + s)/(l — s) = f(s), where s is defined as follows: 


s = 


1 - 


4((7i 1(722 — (721(712) 


7 


((7n + <722) 2 



Furthermore, the eigenvalues of BY, are the roots 
of (1 - A)(l - A) - (< 7 2 i< 7 i 2 )/(<tii< 722 ) = 0, which 
gives A m = 1 + \J{cr 21 ^ 12 )! {^ 11 ^ 22 ) and A m = 

1 - \/{cr 21 ^ 12 )/{crucr 22 )■ Thus, dehning r = 

\f{v 2 ivi 2 )/{vii(? 22 ), we have k[ 5S] = (1 + r)/( 1 - r) = 

f(r). 

We now examine the quotient s/r: 

s _ 1 r 4(1 - r 2 ) 

r r Y ((Tn + (T22) 2 /((Tll(T22) 

Given that (<rn + (T 22) 2 / {^ 11 ^ 22 ) > 4, we have ^ > 
— (1 — r 2 ) = 1. That is, s > r. Since /(*) = 
(l + *)/(l — x) is a monotonically increasing function for 
x > 0, we have f(s ) > /(r). Therefore, k[5S] < «[£]. 
□ 

We think that it should be possible to generalize 
this lemma beyond the univariate, two-component case, 
thereby weakening the conditions on separability in The¬ 
orem 2 in a more general setting. 


Hessian of the EM algorithm tends toward one, showing 
that EM can approximate a superlinear method. Finally, 
in cases of a poorly conditioned Hessian, superlinear con¬ 
vergence is not necessarily a virtue. In such cases many 
optimization schemes, including EM, essentially revert 
to gradient ascent. 

We feel that EM will continue to play an important 
role in the development of learning systems that empha¬ 
size the predictive aspect of data modeling. EM has in¬ 
deed played a critical role in the development of hidden 
Markov models (HMM’s), an important example of pre¬ 
dictive data modeling . 4 EM generally converges rapidly 
in this setting. Similarly, in the case of hierarchical mix¬ 
tures of experts the empirical results on convergence in 
likelihood have been quite promising (Jordan & Jacobs, 
1994; Waterhouse & Robinson, 1994). Finally, EM can 
play an important conceptual role as an organizing prin¬ 
ciple in the design of learning algorithms. Its role in this 
case is to focus attention on the “missing variables” in 
the problem. This clarifies the structure of the algorithm 
and invites comparisons with statistical physics, where 
missing variables often provide a powerful analytic tool. 


6 Conclusions 

In this paper we have provided a comparative analysis 
of algorithms for the learning of Gaussian mixtures. We 
have focused on the EM algorithm and have forged a link 
between EM and gradient methods via the projection 
matrix P. We have also analyzed the convergence of 
EM in terms of properties of the matrix P and the effect 
that P has on the likelihood surface. 

EM has a number of properties that make it a par¬ 
ticularly attractive algorithm for mixture models. It en¬ 
joys automatic satisfaction of probabilistic constraints, 
monotonic convergence without the need to set a learn¬ 
ing rate, and low computational overhead. Although EM 
has the reputation of being a slow algorithm, we feel 
that in the mixture setting the slowness of EM has been 
overstated. Although EM can indeed converge slowly 
for problems in which the mixture components are not 
well separated, the Hessian is poorly conditioned for 
such problems and thus other gradient-based algorithms 
(including Newton’s method) are also likely to perform 
poorly. Moreover, if one’s concern is convergence in like¬ 
lihood, then EM generally performs well even for these 
ill-conditioned problems. Indeed the algorithm provides 
a certain amount of safety in such cases, despite the poor 
conditioning. It is also important to emphasize that 
the case of poorly separated mixture components can 
be viewed as a problem in model selection (too many 
mixture components are being included in the model), 
and should be handled by regularization techniques. 

The fact that EM is a first order algorithm certainly 
implies that EM is no panacea, but does not imply that 
EM has no advantages over gradient ascent or superlin¬ 
ear methods. First, it is important to appreciate that 
convergence rate results are generally obtained for un¬ 
constrained optimization, and are not necessarily indica¬ 
tive of performance on constrained optimization prob¬ 
lems. Also, as we have demonstrated, there are condi¬ 
tions under which the condition number of the effective 
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